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ABSTRACT 


An  on-line  system  for  text  processing  is  currently  under 
development.  For  input,  an  analyst  scans  texts  on  a  display 
and  selects  information  in  the  form  of  statements  which  then 
are  processed  linguistically  and  stored  as  logical  content 
representations.  Having  located  relevant  information,  the 
portion  of  text  from  which  it  was  derived  can  be  recovered. 
Collected  text  passages  can  be  edited  into  a  report,  with  com¬ 
mentary  and  interpretation  added  by  the  analyst. 
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INTRODUCTION 


SAFARI  is  an  on-line  system  for  text  processing  designed  to  be  used 
by  a  group  of  information  analysts  for  building  their  own  personal  files. 
The  system  allows  an  analyst  to  scan  through  documents  displayed  on  a 
CRT  and  to  select  relevant  facts  which  then  are  processed  linguistically 
and  stored  in  the  computer  in  the  form  of  logical  content  representa¬ 
tions.  To  locate  information,  the  analyst  can  initiate  a  search  through 
the  stored  representations  to  identify  those  with  relevant  content. 

When  an  item  of  interest  is  found,  the  original  text  from  which  it  was 
derived  can  be  recovered.  On-line  editing  capabilities  allow  the 
analyst  to  take  all  or  any  parts  of  the  recovered  texts,  add  commentary 
and  interpretation,  and  generate  reports  about  their  content. 

A  preliminary,  simplified  model  of  SAFARI  was  written  for  the  IBM 
7030  computer  in  a  time-shared  environment.  Each  analyst  station  con¬ 
tained  a  large  DD-13  display  console  with  lightgun,  a  typewriter,  and  a 
printer.  The  system  was  programmed  in  TREET,  a  general  purpose  list¬ 
processing  system  particularly  well  suited  for  on-line  use.  Work  is 
now  in  progress  on  the  implementation  of  an  augmented  system  in  a  time- 
shared  IBM/360  facility  using  less  costly  displays.  We  expect  to  con¬ 
struct  successively  more  complex  models  that  will  reflect  an  increasing 
sophistication  in  the  analytic  components  and  that  will  adapt  progress¬ 
ively  to  the  requirements  of  the  analyst. 
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DESIGN  PRINCIPLES 


The  design  of  SAFARI  has  had  two  guiding  principles:  (1)  linguistic 
analysis  and  logical  formalization  must  be  basic  components;  (2)  man/ 
machine  interaction  is  essential. 

The  stress  on  linguistics  and  logic  is  based  on  our  conviction 
that  the  progressive  mechanization  of  procedures  for  handling  textual 
information  content  can  be  accomplished  most  effectively  through  such 
techniques.  The  development  of  transformational  grammar,  in  particular, 
has  led  to  increasing  insights  into  the  syntax  and  semantics  of  natural 
language.  Concomitantly,  recent  non-standard  elaborations  of  the  func¬ 
tional  calculus  have  resulted  in  the  use  of  complex  predicate-argument 
structures  as  normalized  representations  of  the  semantic  content  of 
language.  Furthermore,  current  research  within  transformational  theory 
is  leading  to  a  characterization  of  the  base  or  underlying  structure  of 
a  sentence  which  is  directly  interpretable  in  terms  of  these  special 
predicate-argument  relations.  This  convergence  of  the  results  of  lin¬ 
guistic  research  and  of  research  in  applied  logic  is  responsible  for 
our  confidence  in  the  validity  of  the  first  principle. 

The  importance  of  man/machine  interaction  for  the  design  of  SAFARI 
can  be  understood  in  relation  to  a  number  of  observations.  First,  the 
analyst  must  assume  part  of  the  burden  of  interpreting  the  text;  the 
depth  of  analysis  necessary  for  representing  the  information  content 
through  linguistic  and  logical  techniques  cannot  yet  be  achieved  by 
wholly  mechanical  means.  Actually,  we  do  not  know  what  the  appropriate 
depth  is;  however,  even  assuming  that  our  current  level  of  theoretical 
understanding  were  adequate,  it  is  not  clear  that  computer  procedures 
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can  be  specified  to  implement  that  understanding  in  an  efficient  manner. 
In  successive  models  of  the  system  we  intend  to  shift  the  responsibility 
for  providing  the  mechanics  of  analysis  from  the  analyst  to  the  computer 
as  more  sophisticated  procedures  become  available. 

Second,  not  all  of  the  sentences  in  a  document  are  meaningful,  and 
many  of  them  are  redundant.  Furthermore,  two  analysts  are  unlikely  to 
agree  on  the  relevance  or  importance  of  all  of  the  information  contained 
in  the  document,  although,  hopefully,  if  they  are  interested  in  the  same 
topics,  there  would  be  substantial  areas  of  agreement.  If  two  analysts 
are  reading  the  same  document  but  have  different  interests,  it  should 
be  possible  for  each  to  identify  the  content  relevant  to  his  work  and 
ignore  the  rest. 

Third,  not  all  of  the  information  stored  in  an  analyst's  personal 
files  is  relevant  to  him  at  a  particular  time.  It  should  be  possible 
for  him  to  be  selective  with  respect  to  that  information.  It  also 
should  be  possible  for  him  to  reorganize  and  restructure  the  files  to 
satisfy  new  interests. 

Other  observations  could  be  mustered  to  support  the  desirability 
of  incorporating  man/machine  interaction  into  the  SAFARI  design,  but 
the  foregoing  are  illustrative. 

THE  PRELIMINARY  MODEL 

The  model  system  can  be  regarded  as  having  two  phases:  representa¬ 
tion  and  recovery.  The  input  to  the  representation  phase  is  text--a 
message,  a  news  article,  a  report.  After  it  has  been  read  into  the 
computer,  the  text  is  available  for  display  on  a  CRT  to  the  analyst 
who  reads  the  text  and  extracts  from  it  the  salient  facts  which  he 
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wishes  to  store.  He  then  phrases  these  facts  in  simple  English 
declarative  sentences,  which  are  called  kernels,  and  submits  them  to 
the  computer  for  processing  and  storage.  A  kernel  may  be  constructed 
on  the  display  by  lightgunning,  a  word  at  a  time,  the  component  words 
from  the  displayed  text,  using  the  typewriter  to  introduce  desired  words 
not  present  in  that  text.  Alternatively,  the  entire  kernel  itself  may  be 
typed  in.  When  a  kernel  is  complete,  the  analyst  lightguns  the  command 
"process,”  and  programs  for  syntactic  and  semantic  interpretation 
automatically  convert  the  sentence  into  a  content  representation.  The 
analyst  can  then  begin  to  construct  another  kernel. 

If  for  any  reason  a  kernel  cannot  be  processed,  the  system  provides 
the  analyst  both  with  information  about  the  nature  of  the  problem  and 
with  the  basis  for  correcting  it.  The  nature  of  this  interaction 
between  the  analyst  and  the  system  can  be  appreciated  best  in  the  con¬ 
text  of  a  more  detailed  description  of  the  procedures  involved. 

The  first  step  in  the  machine  processing  of  a  kernel  is  a  lexical 
lookup  of  every  word  in  the  kernel.  If  any  words  not  in  the  machine's 
current  lexicon  are  encountered,  the  analyst  is  asked  to  classify  them. 

At  present,  the  classification  is  merely  part-of -speech  assignment.  The 
list  of  classes  is  displayed,  and  the  appropriate  one  lightgunned.  This 
information  is  internalized  by  the  machine  and  retained  for  future  refer¬ 
ence.  When  it  has  classifications  for  all  the  words  in  a  kernel,  the 
machine  attempts  to  parse  it  according  to  the  grammar  of  the  system. 

If  there  are  no  parsings,  it  can  be  for  one  of  two  reasons.  Either 
the  kernel  is  not  acceptable  to  the  grammar,  in  which  case  it  must  be 
rephrased,  or  there  is  a  wrong  lexical  classification  in  terms  of  the 
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current  kernel  (for  example,  supply  may  be  listed  in  the  lexicon  only 
as  a  verb,  while  in  the  current  kernel  it  is  being  used  as  a  noun) . 

To  ascertain  whether  the  latter  is  the  case,  the  machine  displays  all 
of  the  words  in  the  kernel  with  their  associated  classifications.  The 
analyst  scans  this  list  and,  if  a  classification  is  incomplete,  supplies 
the  necessary  additional  information.  The  lexicon  is  updated  and  the 
kernel  reprocessed. 

If  more  than  one  parsing  is  obtained,  the  kernel  is  ambiguous  and 
the  analyst  is  informed  of  this  fact  so  that  he  may  rephrase. 

The  optimum  and  most  usual  case  is  that  exactly  one  parsing  is 
obtained.  The  machine  uses  the  syntactic  suructure  as  a  basis  for  a 
semantic  interpretation  of  the  kernel.  This  interpretation  at  present 
consists  of  determining  the  activity  expressed  in  the  sentence,  the 
subject,  object,  place,  and  time  of  the  activity,  and  any  property  or 
indication  of  quality  or  quantity  associated  with  the  subject,  object, 
and  place.  These  facts  are  structured  into  a  logical  form  which  is 
stored  as  the  content  representation  of  the  kernel. 

The  recovery  phase  begins  with  the  analyst  informing  the  machine 
that  he  wishes  to  make  a  query.  The  machine  displays  a  list  of  cate¬ 
gories  corresponding  to  the  kinds  of  information  it  possesses  (those 
identified  in  the  previous  paragraph) .  The  analyst  enters  in  this 
display,  via  typewriter  and  lightgun,  the  particular  things  he  wishes 
to  find  out  more  about,  with  query  indicators  on  the  categories  he 
desires  the  machine  to  fill  in.  For  example,  if  he  wishes  to  determine 
the  actions  taken  by  American  planes  during  July,  he  enters  by  SUBJECT 
planes ,  by  PROPERTY  (of  subject)  American,  by  TIME  July  or  during  July, 
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and  by  ACTION  a  Q  (query) .  The  machine  searches  its  store  of  content 
representations.  If  any  meet  the  specifications,  they  are  searched  in  the 
queried  categories.  The  information  obtained  is  displayed,  appropriately 
labeled,  below  the  still-showing  query  list.  (If  none  meet  the  specifi¬ 
cations,  the  analyst  is  informed  of  this.)  Confronted  with  the  new  in¬ 
formation,  the  analyst  may  sharpen,  broaden  or  change  his  query  as  he 
likes,  and  the  machine  will  recompute  the  information  accordingly. 

The  analyst  at  this  point  may  be  satisfied  with  what  he  has  learned. 
He  may,  on  the  other  hand,  want  to  see  the  context  of  some  just- learned 
fact.  By  lightgunning  any  of  the  displayed  answers,  he  can  retrieve  a 
display  of  the  original  section  of  raw  text  from  which  the  kernel  yield¬ 
ing  that  answer  was  derived.  Editing  operations  then  can  be  performed  on 
this  displayed  text  to  reduce  it  to  its  essential  elements  which  may  be 
stored  with  related  portions  of  text.  The  accumulations  of  related  text 
entries  themselves  can  be  displayed  and  edited;  commentaries  or  inter¬ 
pretations  can  be  inserted  by  the  analyst;  and  then  the  result  can  be 
printed  out  on-  or  off-line  for  report  generation. 

AN  EXPERIMENTAL  MODEL 

The  experimental  model  under  development  will  differ  in  three 
major  respects  from  the  preliminary  model,  as  well  as  in  complexity  and 
sophistication  of  processing. 

(1)  Lexical  analysis  procedures  will  be  introduced  to  augment 
lexical  lookup.  Rather  than  list  each  variant  form  of  a  word,  a  mor¬ 
phological  analysis  will  identify  stems  and  affixes.  The  inclusion  of 
syntactic  features  in  the  lexical  entry  will  provide  more  information 
on  grammatical  subcategorization  and  on  co-occurence  restrictions. 
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(2)  The  context-free  parsing  of  kernel  sentences  will  be  replaced 
by  a  transformational  analysis.  The  transformational  analysis  will 
allow  more  complex  grammatical  constructions  to  be  handled.  In  addition, 
relations  among  the  sentence  constituents  (essentially  equivalent  to 
kernels)  will  be  expressed  directly,  and  this  structure  can  be  stored 

as  a  single  content  representation. 

(3)  Thesaurus  and  dictionary  procedures  will  be  added  to  relate 
lexical  items  to  each  other  for  the  retrieval  of  associated  content 
representations.  Degrees  and  kinds  of  relatedness  could  be  specified 
by  the  analyst  to  set  bounds  on  retrieval  or  to  reorganize  the  stored 
information. 
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